Jitter buffer control based on monitoring of delay jitter and conversational dynamics

ABSTRACT

Some implementations involve analyzing audio packets received during a time interval that corresponds with a conversation analysis segment to determine network jitter dynamics data and conversational interactivity data. The network jitter dynamics data may provide an indication of jitter in a network that relays the audio data packets. The conversational interactivity data may provide an indication of interactivity between participants of a conversation represented by the audio data. A jitter buffer size may be controlled according to the network jitter dynamics data and the conversational interactivity data. The time interval may include a plurality of talkspurts.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/988,571, filed Aug. 7, 2020, which is a continuation of U.S. patentapplication Ser. No. 15/302,945, filed Oct. 7, 2016, now U.S. Pat. No.10,742,531, which is the 371 national stage of PCT Application No.PCT/US2015/025078, filed Apr. 9, 2015, which claims priority to ChinesePatent Application No. 201410152754.9, filed Apr. 16, 2014 and U.S.Provisional Patent Application No. 61/989,340, filed May 6, 2014, eachof which is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

This disclosure relates to the processing of audio signals. Inparticular, this disclosure relates to processing audio signals fortelecommunications, including but not limited to processing audiosignals for teleconferencing or video conferencing.

BACKGROUND

Voice transmission over packet networks is subject to delay variation,commonly known as jitter. Jitter may, for example, be measured in termsof inter-arrival time (IAT) variation or packet delay variation (PDV).IAT variation may be measured according to the receive time differenceof adjacent packets. PDV may, for example, be measured by reference totime intervals from a datum or “anchor” packet receive time. In InternetProtocol (IP)-based networks, a fixed delay can be attributed toalgorithmic, processing and propagation delays due to material anddistance, whereas a variable delay may be caused by the fluctuation ofIP network traffic, different transmission paths over the Internet, etc.

VoIP (voice over Internet Protocol) receivers generally rely on a“jitter buffer” to counter the negative impact of jitter. By introducingan additional delay between the time a packet of audio data is receivedand the time that the packet is reproduced, a jitter buffer aims attransforming the uneven flow of arriving packets into a regular flow ofpackets, such that delay variations will not cause perceptual soundquality degradation to the end users. Voice communication is highlydelay sensitive. According to ITU Recommendation G.114, for example,one-way delay should be kept below 150 ms for normal conversation, withabove 400 ms being considered unacceptable. Therefore, the additionaldelay added by a jitter buffer needs to be small enough to avoid causingperceptual sound quality degradation. Unfortunately, a small jitterbuffer will lead to more frequent packet loss when packets arrive laterthan expected due to network delays.

SUMMARY

According to some implementations described herein, a method may involvereceiving audio data. The audio data may include audio packets receivedat actual packet arrival times during a time interval, which maycorrespond with a conversation analysis segment. The conversationanalysis segment may include a plurality of talkspurts. The method mayinvolve analyzing the audio data of the conversation analysis segment todetermine network jitter dynamics data and conversational interactivitydata. The network jitter dynamics data may provide an indication ofjitter in a network that relays the audio data packets. Theconversational interactivity data may provide an indication ofinteractivity between participants of a conversation represented by theaudio data. The method may involve controlling a jitter buffer sizeaccording to both the network jitter dynamics data and theconversational interactivity data.

Analyzing the audio data to determine the network jitter dynamics datamay involve determining at least one of packet delay variation (PDV) orinter-arrival time (IAT) variation based, at least in part, on theactual packet arrival times. Determining PDV may involve comparingexpected packet arrival times with the actual packet arrival times.

According to some implementations, analyzing the audio data may involvedetermining percentile ranges of packet delay times. Determining thenetwork jitter dynamics data may involve determining an inter-percentilerange of packet delay corresponding to a difference between a firstpacket delay time of a first percentile range and a second packet delaytime of a second percentile range. In some examples, analyzing the audiodata may involve determining a range of packet delay times according toorder statistics of packet delay variation. The range of packet delaytimes may include shortest packet delay times, median packet delay timesand longest packet delay times. Determining the network jitter dynamicsdata may involve determining a difference between one of the largestpacket delay times and one of the median packet delay times. In someimplementations, analyzing the audio data to determine the networkjitter dynamics data may involve determining a delay spike presenceprobability and/or a delay spike intensity.

In some examples, analyzing the audio data to determine theconversational interactivity data may involve determining single-talktimes during which only a single conversational participant may bespeaking, double-talk times during which two or more conversationalparticipants may be speaking and mutual silent times during which noconversational participant may be speaking. Analyzing the audio data todetermine the conversational interactivity data may involve determiningat least one of a rate of speaker alternation or a speaker interruptionrate.

Some methods may involve receiving a speaker mute indication and/or apresentation indication. Determining the conversational interactivitydata may involve determining conversational interactivity according toat least one of the speaker mute indication or the presentationindication.

In some implementations, analyzing the audio data to determine theconversational interactivity data may involve determining aconversational interactivity measure (CIM). The CIM may, for examples,be based heuristic rules and/or conversational relative entropy.

For example, the CIM may be based, at least in part, on heuristic rulesthat involve the application of a threshold for a rate of speakeralternation, a threshold for single-talk times during which only asingle conversational participant may be speaking, a threshold fordouble-talk times during which two or more conversational participantsmay be speaking and/or a threshold for mutual silent times during whichno conversational participant may be speaking.

In some implementations, the CIM may be based at least in part onconversational relative entropy. The conversational relative entropy maybe determined, at least in part, according to probabilities ofconversational states. The conversational states may include theprobabilities of single-talk times during which only a singleconversational participant may be speaking, of double-talk times duringwhich two or more conversational participants may be speaking and mutualsilent times during which no conversational participant may be speaking.

According to some implementations, determining the conversationalinteractivity data may involve analyzing the conversational activity ofonly a single conversational participant. For example, analyzing theconversational activity of the single conversational participant mayinvolve determining whether the single conversational participant istalking or not talking. Controlling the jitter buffer size may involvesetting the jitter buffer to a relatively smaller size when the singleconversational participant is talking and setting the jitter buffer to arelatively larger size when the single conversational participant is nottalking.

In some implementations, controlling the jitter buffer size may involvesetting a jitter buffer to a relatively larger size when the networkjitter dynamics data indicates more than a threshold amount of networkjitter. For example, controlling the jitter buffer size may involvesetting a jitter buffer for a first conversational participant to arelatively larger size when the network jitter dynamics data indicatesmore than a threshold amount of network jitter or when theconversational interactivity data indicates less than a threshold amountof conversational participation by the first conversational participant.

According to some implementations, controlling the jitter buffer sizemay involve setting a jitter buffer to a relatively smaller size whenthe network jitter dynamics data indicates less than a threshold amountof network jitter or when the conversational interactivity dataindicates at least a threshold amount of conversational interactivity.In some examples, controlling the jitter buffer size may involve settinga jitter buffer for a first conversational participant to a relativelysmaller size when the network jitter dynamics data indicates less than athreshold amount of network jitter or when the conversationalinteractivity data indicates at least a threshold amount ofconversational participation by the first conversational participant. Insome examples, controlling the jitter buffer size may involve assigninga relatively smaller weighting to the network jitter dynamics data andassigning a relatively larger weighting to the conversationalinteractivity data.

According to some implementations, controlling the jitter buffer sizemay involve setting a jitter buffer size according to one of at leastthree jitter buffer control modes. For example, the jitter buffercontrol modes may include a peak mode, a low-loss mode and a normalmode. In some such implementations, each jitter buffer control mode maycorrespond to a jitter buffer size. However, in some examples, eachjitter buffer control mode may correspond to a range of jitter buffersizes.

At least one of the jitter buffer control modes may correspond tonetwork jitter dynamics data indicating at least a threshold amount ofnetwork jitter and conversational interactivity data indicating at leasta threshold amount of conversational interactivity. At least one of thejitter buffer control modes may correspond to network jitter dynamicsdata indicating at least a threshold amount of network jitter andconversational interactivity data indicating less than a thresholdamount of conversational interactivity. At least one of the jitterbuffer control modes may correspond to network jitter dynamics dataindicating less than a threshold amount of network jitter andconversational interactivity data indicating at least a threshold amountof conversational interactivity. At least one of the jitter buffercontrol modes may correspond to network jitter dynamics data indicatingless than a threshold amount of network jitter and conversationalinteractivity data indicating less than a threshold amount ofconversational interactivity.

According to some implementations, these methods and/or other methodsdisclosed herein may be implemented via one or more non-transitory mediahaving software stored thereon. The software may include instructionsfor controlling one or more devices to perform such methods, at least inpart.

At least some aspects of the present disclosure may be implemented viaapparatus. For example, one or more devices may be capable ofperforming, at least in part, the methods disclosed herein. In someimplementations, an apparatus may include an interface system, a memorysystem that may include a jitter buffer, and a logic system. The logicsystem may be capable of receiving audio data via the interface system.The audio data may include audio packets received at actual packetarrival times during a time interval that may correspond with aconversation analysis segment.

The interface system may include a network interface, an interfacebetween the logic system and the memory system and/or an external deviceinterface. The logic system may include at least one of a generalpurpose single- or multi-chip processor, a digital signal processor(DSP), an application specific integrated circuit (ASIC), a fieldprogrammable gate array (FPGA) or other programmable logic device,discrete gate or transistor logic, or discrete hardware components.

The logic system may be capable of analyzing the audio data of theconversation analysis segment to determine network jitter dynamics dataand conversational interactivity data. The network jitter dynamics datamay provide an indication of jitter in a network that relays the audiodata packets. The conversational interactivity data may provide anindication of interactivity between participants of a conversationrepresented by the audio data. The logic system may be capable ofcontrolling a jitter buffer size according to the network jitterdynamics data and the conversational interactivity data. The timeinterval may correspond with a conversation analysis segment thatincludes a plurality of talkspurts.

In some implementations, analyzing the audio data to determine thenetwork jitter dynamics data may involve determining at least one ofpacket delay variation (PDV) or inter-arrival time (IAT) variation bycomparing expected packet arrival times with the actual packet arrivaltimes. In some examples, analyzing the audio data to determine thenetwork jitter dynamics data may involve determining at least one of adelay spike presence probability or a delay spike intensity.

According to some implementations, analyzing the audio data to determinethe conversational interactivity data may involve determiningsingle-talk times during which only a single conversational participantmay be speaking, double-talk times during which two or moreconversational participants may be speaking and mutual silent timesduring which no conversational participant may be speaking. Analyzingthe audio data to determine the conversational interactivity data mayinvolve determining a conversational interactivity measure (CIM) basedon at least one of heuristic rules or conversational relative entropy.

Details of one or more implementations of the subject matter describedin this specification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages will becomeapparent from the description, the drawings, and the claims. Note thatthe relative dimensions of the following figures may not be drawn toscale.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A is a diagram schematically illustrating an example of a voicecommunication system in which embodiments of the application can beapplied.

FIG. 1B is a diagram schematically illustrating another example of avoice communication system in which aspects of the application can beimplemented.

FIG. 2 is a flow diagram that illustrates blocks of some jitter buttercontrol methods provided herein.

FIG. 3 provides an example of a two-party conversational model thatprovides some examples of conversational states.

FIG. 4 is a flow diagram that illustrates blocks of some jitter buttercontrol methods provided herein.

FIG. 5 is a block diagram that provides examples of components of anapparatus capable of implementing various aspects of this disclosure.

FIG. 6 is a block diagram that provides examples of components of anaudio processing apparatus.

Like reference numbers and designations in the various drawings indicatelike elements.

DESCRIPTION OF EXAMPLE EMBODIMENTS

The following description is directed to certain implementations for thepurposes of describing some innovative aspects of this disclosure, aswell as examples of contexts in which these innovative aspects may beimplemented. However, the teachings herein can be applied in variousdifferent ways. For example, while various implementations are describedin terms of particular examples of audio data processing, the teachingsherein are widely applicable to other known audio data processingimplementations, as well as audio data processing implementations thatmay be introduced in the future.

Moreover, the described embodiments may be implemented in a variety ofhardware, software, firmware, etc. For example, aspects of the presentapplication may be embodied in a system, in a device (e.g., a cellulartelephone, a portable media player, a personal computer, a server, atelevision set-top box, a digital video recorder or other media player),a method or a computer program product. Accordingly, aspects of thepresent application may take the form of a hardware embodiment, asoftware embodiment (including firmware, resident software, microcodes,etc.) or an embodiment combining both software and hardware aspects.Such embodiments may be referred to herein as a “circuit,” “module” or“system.” Furthermore, aspects of the present application may take theform of a computer program product embodied in one or morenon-transitory media having computer readable program code embodiedthereon. Such non-transitory media may, for example, include a harddisk, a random access memory (RAM), a read-only memory (ROM), anerasable programmable read-only memory (EPROM or Flash memory), aportable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. Accordingly, the teachings of this disclosure are notintended to be limited to the implementations shown in the figuresand/or described herein, but instead have wide applicability.

The terms “block” and “packet” are used synonymously herein.Accordingly, an “audio block” or a “block of audio data” will have thesame meaning as an “audio packet” or a “packet of audio data.”

As used herein, the term “buffer” may refer to a region of a physicalmemory device used to temporarily store data, or to a logical or virtualdata buffer that “points” to a location in a physical memory. A “jitterbuffer” will generally refer to a logical or a physical buffer forstoring received audio frames. Although a jitter buffer will generallybe used to temporarily store encoded audio data prior to a decodingprocess, a jitter buffer may store various forms of audio packets oraudio frames, depending on the specific implementation. Therefore,throughout the specification, the term “jitter buffer” shall beconstrued as including both a jitter buffer actually storing (orpointing to) audio frames and a jitter buffer actually storing (orpointing to) various forms of packets (blocks) which will subsequentlybe decoded into audio frames before being played out or being fed intocomponents for further processing. The decoding process may not alwaysbe explicitly discussed in connection with buffering processes, althoughdecoding will generally be performed prior to reproduction or “playback”of the audio data. Accordingly, the term “frame” as used herein shouldbe broadly construed as including a frame already decoded from a packet,a frame still encoded in a packet, a packet itself including one or moreframes, or more than one frame encoded in a packet or already decodedfrom the packet. In other words, in the context of the presentapplication, processing involving a frame may also be construed asprocessing involving a packet, or as processing involving simultaneouslymore than one frame contained in a packet.

In the context of the present application, the meaning of the expression“at the same time” (or the like) includes but is not limited to theexact literal meaning, and shall be construed as “within the same timegap/interval of a predefined granularity”. In the present application,for example, the predefined granularity may be the time gap between twoconsecutively-sent frames/packets (such time gap may be referred to asframe gap), or network probing rate for checking packet arrivals, orprocessing time granularity, but is not limited thereto. For example,one may quantize the arrival time by frame duration/gap, e.g. 20 ms.Alternatively, or additionally, one may represent time as correspondingto an integer packet number. Similarly, in the context of the presentapplication, when involving a specific time point, depending on thecontext it may also mean a time gap of the predefined granularity.Further, when involving a specific time t_(i) (i is integer) for acertain frame where it shall be understood as a time point, assuming forclarity that it indicates the time point when the frame starts on thesender side, or indicates the time point when the reception of the framestarts on the receiver side.

As mentioned above, a jitter buffer can counter the negative impact ofnetwork instability by temporarily storing received packets of audiodata (also referred to herein as “audio packets”), which may correspondto voice data. In some implementations, the jitter buffer may storeaudio packets before the audio packets are provided to a decoder andsubsequently reproduced, e.g., by speakers of a communication terminal.

Determining a suitable jitter buffer size can be challenging. A jitterbuffer that is too small may cause an unacceptable number of audiopackets to be dropped, particularly during times of delay spikes thatmay be caused, e.g., by increased network activity. However, a jitterbuffer that is too long may lead to perceptual delays. Certainconversations, such as interactive conversations between two or morepeople, may require a relatively lower delay in order to avoidirritating the participants in a conversation. However, participants inother conversations, such as a presentation that is predominantly aone-way conversation, are generally more tolerant of an initial delay.

Accordingly, in some implementations described herein, a jitter buffersize may be controlled in order to treat these two types ofconversations differently. Some such implementations providecontext-aware jitter buffer management capable of dynamicallycontrolling a jitter buffer's size based on network jitter dynamics datacorresponding to the long-term network jitter context.

FIG. 1A is a diagram schematically illustrating an example of a voicecommunication system in which embodiments of the application can beapplied. As illustrated in FIG. 1A, conversational participant Aoperates a communication terminal A, and conversational participant Boperates a communication terminal B. Like other communication terminalsshown and described herein, communication terminals A and B may, forexample, include telephones, personal computers, mobile electronicdevices (e.g., cellular telephones, smart phones, tablets, etc.) or thelike. Communication terminals A and B may include components such asthose described below (e.g., with reference to FIGS. 5 and 6 ).

During a voice communication session, such as a teleconference,conversational participant A and conversational participant B may talkto each other via their communication terminals A and B. In thisexample, the communication terminals A and B are capable ofcommunicating via a data link 103. The data link 103 may be implementedas a point-to-point connection, as a communication network, etc.

In this example, communication terminals A and B are capable ofperforming VAD (Voice Activity Detection) on audio blocks of audiosignals captured by one or more microphones. If a voice presence isdetected in an audio block, corresponding processing (e.g., applying again suitable for voice data) may be performed on the audio block by alogic system of the communication terminal and the audio block may betransmitted to another conversational participant's communicationterminal via the data link 103. If no voice presence is decided in anaudio block, corresponding processing (e.g., applying a gain suitablefor non-voice data) may be performed on the audio block by a logicsystem of the communication terminal and the audio block may betransmitted to another conversational participant's communicationterminal via the data link 103.

In this example, communication terminals A and B are capable ofindicating a silent time to the other conversational participant'scommunication terminal. As used herein, a “silent time” is a time duringwhich a conversational participant is not speaking. During a “silenttime,” a conversational participant's communication terminal may detectnon-speech sounds, such as ambient noise. Audio data corresponding tosuch sounds may be processed and may be transmitted to one or more othercommunication terminals. In some implementations, a silent time may beindicated by transmitting silent time metadata (e.g., by setting acorresponding flag or bit), whereas in other implementations a silenttime may be indicated by transmitting nothing during the time periodcorresponding to an audio block. In some implementations, silent timemetadata may correspond with a conversational participant's activationof a “mute” control of a communication terminal.

In this implementation, communication terminals A and B are capable ofestablishing and controlling jitter buffers, which are represented as“JB” in FIGS. 1A and 1B. Here, communication terminals A and B arecapable of receiving encoded audio data, e.g., as audio blocks, andstoring them as entries in a jitter buffer. The entries may correspondto a time, e.g., a time at which the audio blocks are received. Audioblocks in the jitter buffer may be decoded and/or otherwise processedfor reproduction by one or more speakers of the communication terminal.Reception of silent time metadata or nothing may cause correspondingempty entries in the jitter buffer. Communication terminals A and B maybe capable of controlling a jitter buffer size as shown and describedherein.

FIG. 1B is a diagram schematically illustrating another example of avoice communication system in which aspects of the application can beimplemented. In this example, a voice conference may be conducted amongconversational participants A, B and C.

As illustrated in FIG. 1B, conversational participant A operates acommunication terminal A, conversational participant B operates acommunication terminal B, and conversational participant C operates acommunication terminal C. During a voice conference session,conversational participant A, conversational participant B, andconversational participant C may talk to each other through theircommunication terminals A, B, and C, respectively. The communicationterminals illustrated in FIG. 1B may be capable of providing essentiallythe same functionality as those illustrated in FIG. 1A, from theperspective of conversational participants A, B and C. Although threecommunication terminals are illustrated in FIG. 1B, otherimplementations may involve more or fewer communication terminals.

However, in the example shown in FIG. 1B, the communication terminals A,B, and C are configured for communication with another device, which isa server in this example, through a common data link 113 or separatedata links 113. The data link 113 may be implemented as a point-to-pointconnection or a communication network. The communication terminals A, B,and C may be capable of performing VAD and appropriate processing onaudio blocks of the audio signal captured by the communication terminal,e.g., as described above.

In this implementation, communication terminals A, B and C are capableof indicating a silent time to the server. In some implementations, asilent time may be indicated by transmitting silent time metadata (e.g.,by setting a corresponding flag or bit), whereas in otherimplementations a silent time may be indicated by transmitting nothingduring the time period corresponding to an audio block. Thecommunication terminals A, B and C may be capable of including a“timestamp” or similar time metadata with a transmitted audio packet,indicating the transmission time of the audio packet.

In this implementation, the server is also capable of establishing andcontrolling jitter buffers. In the example shown in FIG. 1B, the serverhas established jitter buffers JB_(A), JB_(B) and JB_(C), correspondingto each of the communication terminals A, B and C. For example, theserver may be capable of controlling a jitter buffer size as disclosedherein. In this implementation, the server is capable of receiving theaudio blocks transmitted by the communication terminals A, B and C andof storing them to entries in the jitter buffers JB_(A), JB_(B) andJB_(C) corresponding to the times of the audio blocks. For example, theserver may be capable of storing the audio blocks to entries in thejitter buffers JB_(A), JB_(B) and JB_(C) corresponding to timestamps ofthe audio blocks. Reception of the silent time metadata (or nothing) maycause corresponding empty entries in the jitter buffers.

In this example, the server is also capable of mixing audio blockscorresponding to the same time, from each of the jitter buffers JB_(A),JB_(B) and JB_(C), into a mixed audio block. Copies of the mixed audioblocks may be transmitted to each of the communication terminals A, B,and C. The server may include one or more types of timestamps with eachof the mixed audio blocks. In some instances, for example, communicationterminals A and B may be sending audio packets to the server. The servermay mix the audio packets (either in the time domain or the frequencydomain) before sending the mixed packets to communication terminal C.Whether the server performs such mixing may depend on various factors,such as bandwidth, whether the server is configured for mixing, whethermono or multi-channel is desired for communication terminal C, etc.

Communication terminals A, B, and C are capable of establishing andcontrolling jitter buffers JB in this example. The communicationterminals A, B, and C may receive the mixed audio blocks from the serverand may store them to jitter buffer entries corresponding to the timesof the mixed audio blocks. For example, the jitter buffer entries maycorrespond to a time at which the audio blocks are received. In eachcommunication terminal, audio blocks in the jitter buffer may be decodedand reproduced by a speaker system of the communication terminal.Communication terminals A, B, and C may be capable of controlling ajitter buffer size as disclosed herein.

A jitter buffer may be used to transform an uneven flow of arrivingaudio packets into a regular flow of audio packets, such that delayvariations will not cause perceptual quality degradation toconversational participants. Determining a suitable jitter buffer levelgenerally involves a trade-off between average buffer delay and packetloss rate. Many statistic-based jitter buffer management (JBM)algorithms have been proposed, such as the inter-packet delay variation(IPDV) based JBM algorithms disclosed in United States PatentPublication No. 2009/0003369 and the histogram-based JBM algorithmdisclosed in S. B. Moon, J. Kurose and D. Towsley, Packet Audio PlayoutDelay Adjustment: Performance Bounds and Algorithms, in MultimediaSystems (1998) 6:17-28.

However, the implementations disclosed herein provide alternativemethods of jitter buffer control. Some implementations disclosed hereininvolve analyzing audio data to determine network jitter dynamics dataand conversational interactivity data for context-aware jitter buffercontrol.

FIG. 2 is a flow diagram that illustrates blocks of some jitter buttercontrol methods provided herein. Method 200 may, for example, beperformed (at least in part) by a server or another such device that isconfigured for communicating with communication terminals, such asdescribed above with reference to FIG. 1B. However, some methodsprovided herein may be performed (at least in part) by a communicationterminal. As with other method described herein, the blocks of method200 are not necessarily performed in the order indicated. Moreover, someimplementations of method 200 (and other methods disclosed herein) mayinclude more or fewer blocks than indicated or described.

In this example, method 200 begins with block 205, which involvesreceiving audio data during a time interval that corresponds with aconversation analysis segment. In this example, the audio data includesaudio packets received at actual packet arrival times. The time intervalmay, for example, be a long-term or a short-term time interval thatincludes a plurality of talkspurts. As used herein, the term “talkspurt”corresponds to a continuous (or substantially continuous) segment ofspeech between “mutual silent times” of a conversation. Although audiopackets corresponding to the mutual silent times may include backgroundnoise, etc., the term “mutual silent time” is used herein to mean a timeduring which no conversational participant is speaking. In someimplementations, a packet or frame length may be on the order of tens ofmilliseconds (e.g., 20 ms) and a conversation analysis segment may be onthe order of tens of seconds e.g., 20 seconds.

Here, block 210 involves analyzing the audio data of the conversationanalysis segment to determine network jitter dynamics data andconversational interactivity data. The network jitter dynamics data may,for example, provide an indication of jitter in a network that relaysthe audio data packets. The conversational interactivity data mayprovide an indication of interactivity between participants of aconversation represented by the audio data. In this implementation,block 215 involves controlling a jitter buffer size according to boththe network jitter dynamics data and the conversational interactivitydata.

Analyzing the audio data to determine the network jitter dynamics datamay involve determining at least one of packet delay variation (PDV) orinter-arrival time (IAT) variation based, at least in part, on theactual packet arrival times. The network jitter dynamics data may, forexample, include an inter-percentile range of packet delay, a delayspike probability and/or a delay spike density. For example, in someimplementations a PDV for a set number of packets may be determined as adifference between an actual packet arrival time and an expected packetarrival time.

For example, consider a conversation analysis segment consisting of Mtalkspurts and N audio packets. In this discussion, the followingvariables represent the corresponding quantities set forth below:

-   -   t_(k) ^(i) represents the sender timestamp of the i-th packet in        the k-th talkspurt;    -   r_(k) ^(i) represents the receiver timestamp of the i-th packet        in the k-th talkspurt;    -   n_(k): number of received packets in the k-th talkspurt;    -   p_(k) ^(i) represents the playback timestamp of the i-th packet        in the k-th talkspurt; and    -   p_(k) ^(i)−t_(k) ^(i) represents the playback delay.

In some implementations, an indication of late arrival may be madeaccording to whether the playback timestamp indicates that time forreproduction/playback of the audio packet was earlier than the arrivaltime of the audio packet. For example, an indication of late arrival ofan audio packet will be indicated in a binary form, such that anindication of late arrival is one or zero, e.g., as follows:

$\begin{matrix}{I_{k}^{i} = \{ \begin{matrix}{1,} & {{{if}p_{k}^{i}} < r_{k}^{i}} \\{0,} & {otherwise}\end{matrix} } & ( {{Equation}1} )\end{matrix}$

In Equation 1, I_(k) ^(i) represents a late arrival indication for thei-th packet in the k-th talkspurt. However, in alternativeimplementations, a more granular late arrival indication may be used.For example, a late arrival indication may indicate a degree of latenessaccording to two or more time thresholds.

In some such implementations, a packet delay variation may be determinedaccording to a time window that includes the past w audio packets thathave been received, e.g., as follows:

Δd _(n) ^(i−w+1) , Δd _(n) ^(i−w+2) , . . . , Δd _(n) ^(i),   (Equation2)

In Equation 2, Δd_(n) ^(i) represents the packet delay of the i-th audiopacket, Δd_(n) ^(i−w+1) represents the packet delay of an audio packetreceived (w+1) packets prior to the i-th audio packet, Δd_(n) ^(i−w+2)represents the packet delay of the next audio packet received, etc. Thepacket delay may, for example, be determined according to PDV or IAT.The value of w also may be referred to herein as an “analysis windowlength.” The choice of w determines how fast a corresponding algorithmmay adapt to delay variation. Accordingly, the value of w is subject toa trade-off between accuracy and responsiveness. In someimplementations, w may be in the range of 200 to 1000.

According to some implementations, received audio packets may be sortedinto a range of packet delay times, e.g., according to order statisticsof packet delay variation. For example, order statistics for theabove-referenced packet delay variations may be represented as ΔD¹,ΔD²,. . . ΔD^(w), wherein ΔD¹≤ΔD²≤ . . . ≤ΔD^(w). In this example, ΔD¹represents the smallest delay and ΔD^(w) represents the largest delay.

Accordingly, such a range may include shortest packet delay times,median packet delay times and longest packet delay times. Determiningthe network jitter dynamics data may involve determining a differencebetween the delay times, e.g., a difference between one of the largestpacket delay times and one of the median packet delay times.

Some such implementations involve determining percentile ranges ofpacket delay times. In some such implementations, determining thenetwork jitter dynamics data may involve determining an inter-percentilerange of packet delay corresponding to a difference between a firstpacket delay time of a first percentile range (e.g., for one of thelargest packet delay times) and a second packet delay time of a secondpercentile range (e.g., for one of the median packet delay times).

In some implementations, an inter-percentile range may be determined asfollows:

ΔD^(r)−ΔD^(κ)  (Equation 3)

In some implementations, r may be selected such that ΔD^(r) representsone of the largest packet delay times, whereas κ may be selected suchthat ΔD^(κ) represents a delay time at or near the median of packetdelay times. In one example, r and κ may be determined as follows:

r=round(0.995×w) and κ=round(0.5×w)   (Equations 4 and 5)

In Equations 4 and 5, “round” represents a process of rounding to awhole number. In some implementations, the values of r and κ may bedetermined according to empirical data based on experiments involvingaudio packets transmitted on different types of actual networks.

Determining the network jitter dynamics data may involve determining adelay spike intensity, e.g., based on the number of packets that had adelay beyond a threshold and the amount of delay for those packets,and/or determining a delay spike presence probability. For example, insome implementations, a delay spike presence probability may bedetermined as follows:

$\begin{matrix}{p = \frac{{\sum}_{i}^{w}I_{n}^{i}}{w}} & ( {{Equation}6} )\end{matrix}$

In Equation 6, p represents a delay spike presence probability and I_(n)^(i) represents a delay spike indicator. In some examples, the delayspike indicator of Equation 6 may be determined as follows:

$\begin{matrix}{I_{n}^{i} = \{ \begin{matrix}{0,{{{if}\Delta d_{n}^{i}} < \xi_{th}}} \\{1,{otherwise}}\end{matrix} } & ( {{Equation}7} )\end{matrix}$

In Equation 7, ξ_(th) represents a delay spike threshold. In someimplementations, the delay spike threshold may be in the range of 5 to20 packet intervals, e.g., 10 packet intervals. For example, if anexpected time interval between packets is 20 ms., a delay spikethreshold of 10 would correspond to 200 ms.

Some implementations may involve determining average delay spikeintensity during a time interval. In some examples, the average delayspike intensity may be determined as follows:

$\begin{matrix}{\lambda = \frac{{\sum}_{i = 1}^{w}I_{n}^{i}\Delta d_{n}^{i}}{{\sum}_{i = 1}^{w}I_{n}^{i}}} & ( {{Equation}8} )\end{matrix}$

In Equation 8, λ represents average delay spike intensity for ananalysis window length of w audio packets.

Some implementations may involve combining more than one type of networkjitter dynamics data. For example, some peak mode detection (PMD)implementations may involve combining more than one type of networkjitter dynamics data in order to detect a peak mode of network jitter.In some examples, PMD will involve a “long-term” peak mode detectionduring a conversation analysis segment that includes a plurality oftalkspurts.

In some examples, PMD may be based, at least in part, oninter-percentile range (IPR) calculations (e.g., as described above),delay spike intensity and/or delay spike presence probability. In oneexample, PMD is determined as follows:

$\begin{matrix}{{PMD} = {{f( {{IPR},p,\lambda} )} = \{ \begin{matrix}{1,} & \begin{matrix}{{if}( {{{IPR} > {IPR\_ th}}\&\&{p\  > {p\_ th}}} ){❘❘}} \\( {{{IPR} > {IPR\_ th}}\&\&{\lambda > {\lambda\_ th}}} )\end{matrix} \\{0,} & {otherwise}\end{matrix} }} & ( {{Equation}9} )\end{matrix}$

In Equation 9, IPR_th represents a threshold of the inter-percentilerange of order statistics, p_th represents a threshold of delay spikepresence probability and λ_th represents a threshold of average delayspike intensity. In some implementations, IPR_th may be in the range of5-9, e.g., 7. In some examples, p_th may be in the range of 0.03 to0.07, e.g., 0.05, and λ_th may be in the range of 10-20, e.g., 15.Accordingly, in the example shown in Equation 9, a peak mode detectionprocess will produce a binary value that corresponds to a “yes or no”determination. However, in alternative implementations, relatively moregranular peak mode detection values may be determined. For example, insome implementations a range of two or more PMD values may correspondwith varying degrees of network jitter.

As noted above, various implementations described herein involvecontrolling a jitter buffer size according to both network jitterdynamics data and conversational interactivity data. The foregoingdiscussion has focused mainly on determining various types of networkjitter dynamics data. Following are various examples of determiningconversational interactivity data.

In some implementations, determining conversational interactivity datamay involve determining conversational states. FIG. 3 provides anexample of a two-party conversational model that provides some examplesof conversational states.

In FIG. 3 , the shaded areas above the horizontal time axis t indicatetimes during which conversational participant “A” is talking, whereasthe shaded areas below the time axis indicate times during whichconversational participant “B” is talking. The areas labeled State A andState B indicate “single-talk” times during which a singleconversational participant is speaking.

The area labeled State D indicates a “double-talk” time during whichconversational participant A and conversational participant B are bothspeaking. In some instances, a conversation may include three or moreparticipants. Accordingly, as used herein, a “double-talk” time means atime during which at least two conversational participants are speaking.

State M of FIG. 3 corresponds to a mutual silent time during whichneither conversational participant A nor conversational participant B isspeaking. In view of the fact that a conversation may sometimes includethree or more participants, as used herein, a “mutual silent time” meansa time during which no conversational participant is speaking.

However, determining conversational interactivity data may involvedetermining other types of conversational states. In someimplementations, determining conversational interactivity data mayinvolve determining at least one of speaker alternation rate or aspeaker interruption rate. A successful interruption of conversationalparticipant “A” by conversational participant “B” may, for example, beinferred by the sequence State A/State D/State B. An unsuccessfulinterruption of conversational participant “A” may, for example, bedetermined by a sequence State A/State D/State A. Some implementationsmay involve sending or receiving a speaker mute indication or apresentation indication, e.g., as metadata associated with the audiopackets. For example, a speaker mute indication or a presentationindication may correspond with receipt of input from a conversationalparticipant, such as pressing a “mute” button, entering a codecorresponding to a speaker mute indication or a presentation indication(e.g., via a keyboard or key pad), etc. Determining the conversationalinteractivity data may be based, at least in part, on the speaker muteindication and/or the presentation indication.

In some implementations, analyzing the audio data to determine theconversational interactivity data may involve determining aconversational interactivity measure (CIM) that corresponds with adegree of conversational interactivity. In some implementations, the CIMmay be based on heuristic rules and/or conversational relative entropy.

In some examples, the CIM may be based, at least in part, on heuristicrules that involve the application of at least one of a threshold for arate of speaker alternation, a threshold for single-talk times duringwhich only a single conversational participant is speaking, a thresholdfor double-talk times during which two or more conversationalparticipants are speaking or a threshold for mutual silent times duringwhich no conversational participant is speaking. For heuristic rules,the degree of conversational interactivity may be based on a number ofpackets corresponding to single-talk times, a number of packetscorresponding to mutual silent times, a number of packets correspondingto double-talk times and/or a number of times the speakers change (whichmay be referred to herein as “speaker alternation”) within aconversation analysis segment that includes a plurality of talkspurts.

In one such example, a CIM may be based on heuristic rules as follows:

$\begin{matrix}{{{CIM}(k)} = \{ \begin{matrix}{0,} & \begin{matrix}  ( {\frac{\lambda_{A}(k)}{\lambda_{B}(k)} \in {\lbrack {0,{1/{thresh\_ ST}}} \rbrack\bigcup\lbrack {{thresh\_ ST},{+ \infty}} }}  ) ) \\{{( {{\lambda_{M}(k)}>={thresh\_ MS}} )}( {{{SAR}(k)}<={thresh\_ SAR}} )}\end{matrix} \\{1,} & {otherwise}\end{matrix} } & ( {{Equation}10} )\end{matrix}$

In the example of Equation 10, CIM(k) represents a conversationalinteractivity measure in a kth conversation analysis segment, thresh_STrepresents a threshold for single-talk times, thresh_MS represents athreshold for mutual silent times and thresh_SAR represents a thresholdfor speaker alternation. In some implementations, thresh_ST may be inthe range of 8 to 12, e.g., 10. In some examples, thresh MS may be inthe range of 0.7-0.9 (e.g., 0.8) and thresh SAR may be in the range of0.003-0.007 (e.g., 0.005).

In the example shown in Equation 10, a conversational interactivitymeasure will be a binary value that may be thought of as correspondingto an “interactive” or “not interactive” determination. However, inalternative implementations, relatively more granular values ofconversational interactivity may be determined. For example, in someimplementations a range of two or more CIM values may correspond withvarying degrees of conversational interactivity.

In some implementations wherein the CIM is based at least in part onconversational relative entropy, the conversational relative entropy maybe based, at least in part, on probabilities of conversational states.In some examples, the conversational states may correspond with theprobabilities of single-talk times, of double-talk times and of mutualsilent times. In one such example, the relative entropy of aconversation involving conversational participant A and conversationalparticipant B may be determined as follows:

$\begin{matrix}{{E(k)} = {\sum\limits_{I = {\{{A,B,M,D}\}}}{{- {P_{I}(k)}}{\log_{2}( \frac{P_{I}(k)}{Q_{I}(k)} )}}}} & ( {{Equation}11} )\end{matrix}$

In Equation 11, E(k) represents a conversational relative entropy-basedCIM in a kth conversation analysis segment and P_(I)(k) represents aprobability of a conversational state in the kth conversation analysissegment. In this example, P_(A)(k) represents the probability of asingle-talk time for conversational participant A, P_(B)(k) representsthe probability of a single-talk time for conversational participant B,P_(M)(k) represents the probability of a mutual silent time during whichneither conversational participant is speaking and P_(D)(k) representsthe probability of a double-talk time during which both conversationalparticipants are speaking.

In Equation 11, Q_(I)(k) represents the probability of each statecorresponding with a presentation mode. Determining the probability of apresentation mode may be based, at least in part, on the relative amountof time that a particular conversational participant is speaking, on therelative length of a conversational participant's talkspurts relative tothe length of other conversational participants' talkspurts, etc. Asnoted above, some implementations may involve sending or receiving aspeaker mute indication or a presentation indication, e.g., as metadataassociated with the audio packets. Therefore, determining theprobability of a presentation mode may be based, at least in part, on aspeaker mute indication and/or a presentation indication.

In some examples, controlling a jitter buffer size according to networkjitter dynamics data and conversational interactivity data (block 215 ofFIG. 2 ) may involve setting a jitter buffer to a relatively larger sizewhen the network jitter dynamics data indicates more than a thresholdamount of network jitter and/or when the conversational interactivitydata indicates less than a threshold amount of conversationalinteractivity. Block 215 may involve setting a jitter buffer to arelatively smaller size when the network jitter dynamics data indicatesless than a threshold amount of network jitter and/or when theconversational interactivity data indicates at least a threshold amountof conversational interactivity. In some examples, block 215 may involvesetting a jitter buffer to a relatively larger size when the networkjitter dynamics data indicates more than a threshold amount of networkjitter and/or when the conversational interactivity data indicates lessthan a threshold amount of conversational participation by a firstconversational participant. In some examples, block 215 may involvesetting a jitter buffer for the first conversational participant to arelatively smaller size when the network jitter dynamics data indicatesless than a threshold amount of network jitter and/or when theconversational interactivity data indicates at least a threshold amountof conversational participation by the first conversational participant.

However, the network jitter dynamics data and the conversationalinteractivity data may or may not be given equal weight in thedetermination of jitter buffer size, depending on the implementation. Insome implementations, for example, controlling the jitter buffer sizeaccording to both the network jitter dynamics data and theconversational interactivity data may involve assigning a relativelysmaller weighting to the network jitter dynamics data and assigning arelatively larger weighting to the conversational interactivity data.For example, if the conversational interactivity data indicate thatthere is a high probability that a conversation is in a presentationmode, a relatively large jitter buffer size may be determined. Someimplementations involve controlling a jitter buffer size according toone of at least three jitter buffer control modes, e.g., as describedbelow. Each of the jitter buffer control modes may correspond to ajitter buffer size (or size range). In some implementations, a “small”jitter buffer size may be in the range of tens of milliseconds, e.g., 5ms, 10 ms, 25 ms, 30 ms, 35 ms, 40 ms, 45 ms, 50 ms, 55 ms, etc. In someexamples, a “large” jitter buffer size may be in the range of hundredsof milliseconds, e.g., 500 ms, 600 ms, 700 ms, 800 ms, 900 ms, 1000 ms,1100 ms, 1200 ms, 1300 ms, 1400 ms, 1500 ms, etc. In someimplementations, the jitter buffer size may be fixed until the networkjitter dynamics data and/or the conversational interactivity dataindicate a change in jitter buffer control mode. However, in otherimplementations, the jitter buffer size may vary adaptively within asize range during the operation of a single jitter buffer control mode,e.g., as discussed below.

FIG. 4 is a flow diagram that illustrates blocks of some jitter buttercontrol methods provided herein. Method 400 may, for example, beperformed (at least in part) by a server or another such device that isconfigured for communicating with communication terminals, such asdescribed above with reference to FIG. 1B. However, some methodsprovided herein may be performed (at least in part) by a communicationterminal. As with other method described herein, the blocks of method400 are not necessarily performed in the order indicated. Moreover, someimplementations of method 400 may include more or fewer blocks thanindicated or described.

Here, method 400 begins with block 405, which involves receiving audiopackets of a conversation analysis segment. In block 410, aconversational interactivity analysis is performed, which may be along-term conversational interactivity analysis. As used herein, a“long-term” time interval includes at least a plurality of talkspurts.If it is determined in block 415 that the conversational interactivityanalysis indicates a presentation mode, the process continues to block440, in which a long delay/low loss (relatively large) jitter buffermode is determined. In this example, the long delay/low loss modecorresponds with a jitter buffer size, or at least with a jitter buffersize range. In this example, therefore, no peak mode detection isnecessary for the jitter buffer size (or at least a jitter buffer sizerange) to be determined. However, the jitter buffer size may varyadaptively during a long delay/low loss mode. In some implementations,during a long delay/low loss mode the parameters that can control atradeoff between packet loss rate and jitter buffer delay may be tunedto reduce packet loss at the cost of increasing the buffer delay.

However, if it is determined in block 415 that the conversationalinteractivity analysis does not indicate a presentation mode, theprocess continues to peak mode detection block 440, which involvesanalyzing the audio packets of the conversation analysis segment todetermine network jitter dynamics data. In this example, block 425involves determining whether a peak mode (corresponding to high networkjitter) is present, based on the network jitter dynamics data determinedin block 420.

If it is determined in block 425 that the network jitter dynamics dataindicates a peak mode, the process continues to block 435, in which apeak mode jitter buffer size (or size range) is determined. The peakmode jitter buffer size may be a relatively large jitter buffer size. Insome implementations, the peak mode jitter buffer size may be the samesize as the long delay/low loss jitter buffer size corresponding toblock 440, whereas in other implementations the peak mode jitter buffersize may be smaller than the jitter buffer size corresponding to block440. However, the jitter buffer size may vary adaptively during a peakmode. In some implementations, during a peak mode the maximum talk-spurtstart delay will be increased. In some examples, asymmetric “attack” and“decay” processes of jitter buffer control may be implemented during apeak mode. For example, in some peak mode implementations, an “attack”process of jitter buffer control can allow the jitter buffer size toincrease quickly in response to instantaneous spike jitter. In some peakmode examples, the jitter buffer size may be controlled to reduce thepacket loss rate when there is bursty arrival of delayed packets after adelay spike.

If it is determined in block 425 that the network jitter dynamics datadoes not indicate a peak mode, the process continues to block 430, inwhich a “normal mode” jitter buffer size (or size range) is determined.In this example, the “normal mode” jitter buffer size range is smallerthan the peak mode jitter buffer size range and smaller than the longdelay/low loss jitter buffer size range. However, the jitter buffer sizemay vary adaptively during a normal mode. For example, the jitter buffersize may vary adaptively according to methods known to those of skill inthe art. Method 400 may be performed on an ongoing basis, such thatadditional audio packets may be received and analyzed after a jitterbuffer size is determined in block 430, 435 or 440.

In some implementations, controlling the jitter buffer size may involvesetting a jitter buffer size according to one of at least three jitterbuffer control modes. Each of the jitter buffer control modes maycorrespond to a jitter buffer size (or size range). One example is shownin Table 1, below:

TABLE 1 Algorithm Configurations isPresentationMode isPeakMode Jitterbuffer profile in long delay 1 1 (low loss) mode Jitter buffer profilein long delay 1 0 (low loss) mode Jitter buffer profile tuned in peak 01 mode Jitter buffer profile in normal 0 0 mode

In this example, the largest of the jitter buffer sizes (or size ranges)corresponds to network jitter dynamics data indicating more than athreshold amount of network jitter (“isPeakMode”) and conversationalinteractivity data indicating less than a threshold amount ofconversational interactivity (“isPresentationMode”). Here, a bit is setcorresponding to “isPresentationMode” and a bit is set corresponding to“isPeakMode.” In this implementation, the largest jitter buffer sizecorresponds with a long delay/low loss mode. In some implementations,this mode will be indicated when network jitter dynamics data indicatesmore than a threshold amount of network jitter and when conversationalinteractivity data indicates less than a threshold amount ofconversational participation by a single conversational participant. Insome examples, this mode will be indicated when network jitter dynamicsdata indicates more than a threshold amount of network jitter and whenconversational interactivity data indicates less than a threshold amountof conversational participation by all but one conversationalparticipant, or all but a threshold number of conversationalparticipants. In some examples, the largest jitter buffer size may be inthe range of hundreds of milliseconds, e.g., 500 ms, 600 ms, 700 ms, 800ms, 900 ms, 1000 ms, 1100 ms, 1200 ms, 1300 ms, 1400 ms, 1500 ms, etc.In some implementations, the jitter buffer size may be fixed at thelargest jitter buffer size until the network jitter dynamics data and/orthe conversational interactivity data indicate a change. However, inother implementations, the jitter buffer size may vary adaptively withina size range, e.g., according to parameters for a long delay/low lossmode described above.

In this implementation, the largest or the second-largest of the jitterbuffer sizes (or size ranges) corresponds to network jitter dynamicsdata indicating less than a threshold amount of network jitter andconversational interactivity data indicating at least a threshold amountof conversational interactivity. Here, “isPresentationMode” is set toone and “isPeakMode” is set to zero. In some implementations, this modewill be indicated when network jitter dynamics data indicates less thana threshold amount of network jitter and when conversationalinteractivity data indicates less than a threshold amount ofconversational participation by a single conversational participant. Insome examples, this mode will be indicated when network jitter dynamicsdata indicates less than a threshold amount of network jitter and whenconversational interactivity data indicates less than a threshold amountof conversational participation by all but one conversationalparticipant, or all but a threshold number of conversationalparticipants. In this implementation, the largest jitter buffer sizecorresponds with a long delay/low loss mode, which involves the largestof at least three jitter buffer sizes (or size ranges). However, inalternative implementations, this condition may correspond with thesecond-largest of at least four jitter buffer sizes (or size ranges), orthe larger of two jitter buffer sizes (or size ranges).

In this example, the second-smallest of the jitter buffer sizescorresponds to network jitter dynamics data indicating greater than athreshold amount of network jitter and conversational interactivity dataindicating less than a threshold amount of conversational interactivity.Here, “isPresentationMode” is set to zero and “isPeakMode” is set toone. In some implementations, this mode will be indicated when networkjitter dynamics data indicates more than a threshold amount of networkjitter and when conversational interactivity data indicates more than athreshold amount of conversational participation by a singleconversational participant. In some examples, this mode will beindicated when network jitter dynamics data indicates more than athreshold amount of network jitter and when conversational interactivitydata indicates more than a threshold amount of conversationalparticipation by a threshold number of conversational participants. Inthis example, the second-smallest of the jitter buffer sizes correspondsto a peak mode jitter buffer size. However, in some alternativeimplementations, the peak mode jitter buffer size may be the same sizeas the jitter buffer that corresponds with a long delay/low loss mode.As noted above, in some implementations the jitter buffer size may bevariable during a peak mode. In some implementations, during a peak modethe maximum talk-spurt start delay will be increased.

In some examples, asymmetric “attack” and “decay” processes of jitterbuffer control may be implemented during a peak mode. As used herein,the term “attack” may correspond with a response to a delay spike (e.g.,by increasing the jitter buffer size) and “decay” may correspond with anon-attack process, such as a return to a lower jitter buffer size afteran attack process. For example, applying asymmetrical smoothingparameters for attack and decay processes may involve applying an attacksmoothing parameter if the PDV or IAT is greater than a current jitterbuffer size. In some implementations, applying asymmetrical smoothingparameters for attack and decay processes may involve applying a decaysmoothing parameter if the PDV or IAT is not greater than a currentjitter buffer size. In some peak mode implementations, an “attack”process of jitter buffer control can allow the jitter buffer size toincrease quickly in response to instantaneous spike jitter. The jitterbuffer size may be controlled to reduce the packet loss rate when thereis bursty arrival of delayed packets after a delay spike.

In this implementation, the smallest of the jitter buffer sizescorresponds to network jitter dynamics data indicating less than athreshold amount of network jitter and conversational interactivity dataindicating less than a threshold amount of conversational interactivity.In some implementations, this mode will be indicated when network jitterdynamics data indicates less than a threshold amount of network jitterand when conversational interactivity data indicates more than athreshold amount of conversational participation by a singleconversational participant. In some examples, this mode will beindicated when network jitter dynamics data indicates less than athreshold amount of network jitter and when conversational interactivitydata indicates more than a threshold amount of conversationalparticipation by a threshold number of conversational participants.Here, “isPresentationMode” is set to zero and “isPeakMode” is set tozero, indicating a “normal mode” of jitter buffer control and acorresponding normal mode jitter buffer size range. In this example, thenormal mode jitter buffer size range is smaller than the peak modejitter buffer size range and smaller than the long delay/low loss jitterbuffer size range. In some implementations, the normal mode jitterbuffer size range may be in the range of tens of milliseconds, e.g., 25ms, 30 ms, 35 ms, 40 ms, 45 ms, 50 ms, 55 ms, etc. However, the jitterbuffer size may vary adaptively during a normal mode. For example, thejitter buffer size may vary adaptively according to methods known tothose of skill in the art.

Although various aspects of the current disclosure may be implemented ina device, such as a server, which may facilitate communications betweenmultiple communication terminals, some aspects may be implemented (atleast in part) in a single communication terminal. In some suchexamples, determining the conversational interactivity data may involveanalyzing the conversational activity of only a single conversationalparticipant. Because such implementations may involve analyzing theconversational activity of a conversational participant who is using aparticular communication terminal, the single conversational participantmay be referred to as a “local” or “near-end” conversationalparticipant.

In some such implementations, analyzing the conversational activity ofthe single conversational participant may involve determining whetherthe single conversational participant is talking or not talking.Controlling the jitter buffer size may involve setting the jitter bufferto a relatively smaller size (or size range) when the singleconversational participant is talking and setting the jitter buffer to arelatively larger size (or size range) when the single conversationalparticipant is not talking.

Setting the jitter buffer to a relatively smaller size (or size range)when the near-end conversational participant is talking can allow other“far-end” conversational participants to interrupt and/or respond to thenear-end conversational participant quickly. When the near-endconversational participant is not talking, this condition indicates thatanother conversational participant is likely to be taking or that thereis a mutual silent time. In such conditions, the jitter buffer length ofthe near-end conversational participant can be set to a relativelylarger size (or size range), because a longer delay may be acceptable.

According to some such embodiments, statistics of the talk-spurt lengthdistribution of the near-end conversational participant may be used tocontrol the jitter buffer algorithm. For example, if the talk-spurts ofthe near-end conversational participant tend to be relatively long, thismay indicate that the near-end conversational participant is giving apresentation.

FIG. 5 is a block diagram that provides examples of components of anapparatus capable of implementing various aspects of this disclosure.The apparatus 500 may, for example, be (or may be a portion of) acommunication terminal, a server, etc. In some examples, the apparatusmay be implemented in a component of another device. For example, insome implementations the apparatus 500 may be a line card.

In this example, the apparatus 500 includes an interface system 505, amemory system 510 and a logic system 515. The logic system 515 and/orthe memory system 510 may be capable of establishing one or more jitterbuffers in the memory system 510. The interface system 505 may include anetwork interface, an interface between the logic system and the memorysystem and/or an external device interface (such as a universal serialbus (USB) interface). The logic system 515 may, for example, include ageneral purpose single- or multi-chip processor, a digital signalprocessor (DSP), an application specific integrated circuit (ASIC), afield programmable gate array (FPGA) or other programmable logic device,discrete gate or transistor logic, and/or discrete hardware components.

In this example, the logic system 515 is capable of receiving audio datavia the interface system. The audio data may include audio packetsreceived at actual packet arrival times during a time interval thatcorresponds with a conversation analysis segment. The time interval maybe a long-term time interval that includes a plurality of talkspurts.The logic system 515 may be capable of analyzing the audio data todetermine network jitter dynamics data and conversational interactivitydata and of controlling a jitter buffer size according to the networkjitter dynamics data and the conversational interactivity data.

Analyzing the audio data to determine the network jitter dynamics datamay involve determining at least one of packet delay variation (PDV) orinter-arrival time (IAT) variation by comparing expected packet arrivaltimes with the actual packet arrival times. Analyzing the audio data todetermine the network jitter dynamics data may involve determining atleast one of a delay spike presence probability or a delay spikeintensity.

Analyzing the audio data to determine the conversational interactivitydata involves determining single-talk times, double-talk times andmutual silent times. In some implementations, analyzing the audio datato determine the conversational interactivity data may involvedetermining a conversational interactivity measure (CIM) based onheuristic rules and/or conversational relative entropy.

FIG. 6 is a block diagram that provides examples of components of anaudio processing apparatus. In this example, the device 600 includes aninterface system 605. The interface system 605 may include a networkinterface, such as a wireless network interface. Alternatively, oradditionally, the interface system 605 may include a universal serialbus (USB) interface or another such interface.

The device 600 includes a logic system 610. The logic system 610 mayinclude a processor, such as a general purpose single- or multi-chipprocessor. The logic system 610 may include a digital signal processor(DSP), an application specific integrated circuit (ASIC), a fieldprogrammable gate array (FPGA) or other programmable logic device,discrete gate or transistor logic, or discrete hardware components, orcombinations thereof. The logic system 610 may be configured to controlthe other components of the device 600. For example, the logic system610 may be capable of controlling a size of one or more jitter buffersin the memory system 615. Although no interfaces between the componentsof the device 600 are shown in FIG. 6 , the logic system 610 may beconfigured with interfaces for communication with the other components.The other components may or may not be configured for communication withone another, as appropriate.

The logic system 610 may be configured to perform audio data analysisand jitter buffer control functionality, including but not limited tothe functionality described herein. In some such implementations, thelogic system 610 may be configured to operate (at least in part)according to software stored one or more non-transitory media. Thenon-transitory media may include memory associated with the logic system610, such as random access memory (RAM) and/or read-only memory (ROM).The non-transitory media may include memory of the memory system 615.The memory system 615 may include one or more suitable types ofnon-transitory storage media, such as flash memory, a hard drive, etc.

The logic system 610 may be configured to receive frames of encodedaudio data via the interface system 605 and to decode the encoded audiodata. Alternatively, or additionally, the logic system 610 may beconfigured to receive frames of encoded audio data via an interfacebetween the memory system 615 and the logic system 610. The logic system610 may be configured to control the speaker(s) 620 according to decodedaudio data.

The display system 630 may include one or more suitable types ofdisplay, depending on the manifestation of the device 600. For example,the display system 630 may include a liquid crystal display, a plasmadisplay, a bistable display, etc.

The user input system 635 may include one or more devices configured toaccept input from a user. In some implementations, the user input system635 may include a touch screen that overlays a display of the displaysystem 630. The user input system 635 may include a mouse, a track ball,a gesture detection system, a joystick, one or more GUIs and/or menuspresented on the display system 630, buttons, a keyboard, switches, etc.In some implementations, the user input system 635 may include themicrophone 625: a user may provide voice commands for the device 600 viathe microphone 625. The logic system may be configured for speechrecognition and for controlling at least some operations of the device600 according to such voice commands.

The power system 640 may include one or more suitable energy storagedevices, such as a nickel-cadmium battery or a lithium-ion battery. Thepower system 640 may be configured to receive power from an electricaloutlet.

Various modifications to the implementations described in thisdisclosure may be readily apparent to those having ordinary skill in theart. The general principles defined herein may be applied to otherimplementations without departing from the spirit or scope of thisdisclosure. Thus, the claims are not intended to be limited to theimplementations shown herein, but are to be accorded the widest scopeconsistent with this disclosure, the principles and the novel featuresdisclosed herein.

1. A method, comprising: receiving audio data frames, the audio dataframes included in audio data packets received at actual packet arrivaltimes during a time interval that corresponds with a conversationanalysis segment; analyzing the audio data of the conversation analysissegment to determine network jitter dynamics data and conversationalinteractivity data, wherein the network jitter dynamics data provides anindication of jitter in a network that relays the audio data packets andwherein the conversational interactivity data provides an indication ofinteractivity between participants of a conversation represented by theaudio data frame; selecting, based on the network jitter dynamics andthe conversational interactivity data of the audio data, a jitter buffercontrol mode from at least three jitter buffer control modes, whereineach jitter buffer control mode corresponds to a range of jitter buffersizes; and determining a jitter buffer size based on the selected jitterbuffer control mode.
 2. The method of claim 1, wherein determining thejitter buffer size comprises adaptively varying the jitter buffer sizewithin the range of sizes for the selected jitter buffer control mode.3. The method of claim 1, wherein the at least three jitter buffercontrol modes include a peak mode, a low-loss mode and a normal mode,wherein the peak mode is determined by using inter-percentile, IPR,calculations, delay spike intensity and/or spike presence probability,wherein the low-loss mode is related to a presentation mode, wherein thenormal mode is a non-peak mode and non-presentation mode.
 4. Anapparatus configured to perform the method of claim
 1. 5. A computerprogram product having instructions which, when executed by a computingdevice or system, cause said computing device or system to perform themethod of claim 1.